Author: Ben Garrett Created Date: 03/23/2021

The following article is written as a Jupyter Notebook. All code can be run and tweaked by clicking the 'Open in Colab' button. Although it can be opend in Colab, the dataset and models utilized in this demo are too large to run with Colab's free-tier hardware constraints

Introduction

In this demonstration, I will compare several different models that can be used for time series analysis. Time series data is a unique type of data where observations are recorded at discrete points in time. Forecasting the future is one of the most relavant ways to utilize machine learning and statistics in a business context. These forecasts could be predicting future sales, demand load in a electric grid, the next word or sentence in a text auto-complete product or anything else. In this project, I will use historical tide height measurements to build four types of time series forecasting models. Although empirical equations have been developed to predict tides with acceptable accuracy already, this dataset provides a great opportunity to test out different model architectures on a complex yet solvable mathematical problem with reasonable runtime for experimentation. Although this is a 'solved' problem, I learned that NOAA's tide predictions are not as perfect as I expected with typical RMSE's of 20 to 150 mm (read more here). These errors will be used as a general benchmark for the models I build.

Models used in this project:

These models will be built to predict hourly tide heights for the next 7 days given a chunk of historical observations.

Background

Tides are caused by gravitational forces of the moon, and to a lesser extent, the sun. Additionally, local tides are controlled by coastline geography and bathymetry (surbsurface depth and shape). Local weather can also play a significant role as high and low pressure systems exert different displacement forces on the ocean's surface. Although we can predict moon and sun position with high accuracy, the abiity to map the ocean's sub-surface at large scale and with high resolution is very new. The NOAA predicts tides using years of historical data and fairly sophisticated sine-wave algorithms that are tuned to a local vertical sea level datum. More information about the current NOAA methodology can be found here:

https://tidesandcurrents.noaa.gov/restles1.html

Their method does not use machine learning. The goal of this project is not specifically to improve their method. This is a solved problem and this project simply aims to demonstrate the use of behavior of several models for predicting multi-scale trends.

The Data

Tides have been recorded in many harbors for centuries. Thus, there is an abundance of available tide height data for most port cities around the world. I choose to use a dataset of hourly ocean height measurements since 1911 for the Victoria, BC harbor. The dataset source is:

http://uhslc.soest.hawaii.edu/data/

Caldwell, P. C., M. A. Merrifield, P. R. Thompson (2015), Sea level measured by tide gauges from global oceans — the Joint Archive for Sea Level holdings (NCEI Accession 0019568), Version 5.5, NOAA National Centers for Environmental Information, Dataset, doi:10.7289/V5V40S7W.

Get the Data

The data can be accessed from my Github page. The following code gets the tabular dataset. Tide measurements are in mm. The dataset contains many null/bad values (value for bad data is -32767). The machine learning models used here take sequences of historical data and make predictions from them. Thus, as you will see below, historical sequences with these bad values will be excluded from the training datasets. The autoregression model needs a much longer history than the others to produce similar accuracy, thus it was impossible to simply drop sequences with these values. For that model, the bad values were simple deleted from the training dataset. This is not ideal, and in a final, deployable system I would build a sub-model to interpolate the bad values before training the final model.

Here we will look at two charts:
1) Average tide height across all dates on a daily, weekly, annual and all time basis. This will show us global trends in the data.
2) Visualization of a random day, week, month of sequential data.

XG Boost Model

XG Boost is a powerful machine learning library that provides an easy interface to several types of gradient-boosted models. Of their models, their boosted tree model is remarkably performant, flexible and fast on many supervised machine learning tasks.

To predict hourly tide heights for the next 7 days, we need to re-frame the dataset into a supervised format where you have an array X, containing sequences of historical observations and for each sequence an array y, containing the next 168 values. XG Boost does not natively support predicting multiple values for y, thus I built a MultiOutput class that takes in a model and its parameters, and builds a submodel trained to predict each future value in the sequence.

Convolutional Neural Network

A CNN is a neural network architecture originally designed to abstract patterns from image data. A color image is simply a grid of pixels with associated values (colors) for each pixel. This data type is actually quite similar to a time series. In both, the location of an observation (moment in time for time series or x y position in a pixel grid) correlates with its value. A CNN takes in an collection of arrays, and applies filters of specified size and shape to each array to abstract trends at different granularities from each array. A standard 2D CNN takes in a 4D array. The first dimension is the number of batches (images or in this case chunks of historical observations), the second and third dimension are the pixel grid length and width, and the fourth is the attributes of each pixel (RGB for an image, tide height for this model). In order to use a 2D architecture, first we mush reshape and batch the data. Keras also includes a 1D CNN class (flattened pixel grid) which would be even simpler to implement here since there would be less reshaping, but I chose a 2D network since this same architecture can be applied to image datasets. Neural networks generally train slowly and unpredictably when loss values are very large. Thus, before training the model, I scale all of the data to small values by dividing by the mean height of the training dataset.

To summarize, the below model takes in a 64 hour x 64 hour (about 6 months) grid of historical observations, applies 3 layers of 32 filters, and outputs a sequence of predictions for the next 7 days.

Long Short-Term Memory Recurrent Neural Network

LSTM networks contain feedback connections that provide a means of retaining 'memory' of previously observed trends. Much like the above CNN architecture, the model takes in a batch of historical observations for each calculation step to develop its prediction, which is a sequence of 7 days or hourly tide heights. For each batch of historic observations, the label is a sequence of 168 observations that begin at the last time step of the batch. You can also structure the data so that for each time step within the batch, there is an associated y sequence to increase the number of training gradients. Doing so with this dataset would increase the number of values in the y_train array from 95 million to 30 billion! Thus, it could not live in memory, would take a long time to train and you would have to utilize Tensorflow's tensor operations to window the data to avoid memory overflow. LSTM layers can be combined with convolutional layers to often achieve better performance. Additionally, Keras' Gated Rectified Unit (GRU) layer often performs just as well as a LSTM yet runs slightly faster.

Autoregression Model

Autoregression is a type of regression with only one variable. Instead of trying to find a relationship between two or more variables, an x and y, autoregression looks at a 'lag' of historical observations to find a correlation between a value at t+1 and the values in the lag. This is simply a linear regression where each x is a sequence of historical observations and each y is the next observation. A true autoregression model is only able to make a single prediction one time step into the future. In order to build a system where we can predict the next 7 days worth of tide heights, we have to:

This requires a simple loop. The Statsmodels package, which has an easy to use autoregression api, provides parameters in the model.params method. This produces an array where the first item is the y-intercept, and the remaining values are weights with the first weight corresponding to the most recent historic observation. With the other models, we were able to ignore sequences with bad values in them. This is not easy to do with an autoregression model. The below code simple drops the null values but in a final product it would be worth exploring interpolation methods to fill in the holes.

Autoregression-type Model in Tensorflow

To implement a model that behaves similarly to autoregression in tensorflow, you could either use Tensorflow's Autoregressive class or, more interestingly, create a fully-connected neural network with no activation. The input to the network would be a 2D array (dataset length, lag) and an output being a sequence of future predictions. Assuming the datset size, and size of the dense layer is sufficient, this model should mimic the above autoregression model, however it uses observations further in the future than t+1 to calculate training gradients. Being able to run this model on GPU could be very advantageous compared to the above method if the dataset size was any larger. Here is an example of a fully connected neural network:

Conclusion

This project demonstrates the use of many different model type to predict a real world problem. However, none of the models presented here have been tuned. Neural networks contain an infinite number of hyperparameters that can be adjusted which could improve predictive performance. Based on the results of the above preliminary tests, it seems a CNN model is likely the best choice and I would spend much more time experimenting with different configurations before deploying a model. For the neural networks, I would test different:

However, it is important to remember that training a model on better data is ALWAYS more impactful than finding the perfect architecture. More time should be spent inquiring into how the data was collected, what exactly the bad values represent, etc. Additionally, it is impossible to perfectly model the future. As discussed above, local weather and measurement error leads to variation that no model would ever be able to perfectly anticipate. Thus, it is often worth some time to try to model a probability range of future values, rather than a single number to appropriately communicate this uncertainty to decision makers.